Skip to main content
Version: V11

Web Content Extractor Node

The Web Content Extractor node fetches HTML content from web URLs and extracts clean, readable text for downstream processing. It uses intelligent content detection to focus on main article areas while filtering navigation, advertisements, and boilerplate elements. Multiple URLs are processed concurrently for improved performance.

How It Works

When the node executes, it receives search results containing URLs, fetches each web page concurrently using HTTP requests, parses the HTML to identify main content areas, removes unwanted elements like scripts and navigation, and extracts clean text formatted for consumption by language models or analysis nodes. The extracted content replaces the original search snippets in the output, providing comprehensive information instead of brief summaries.

The node identifies main content using common HTML patterns like <main>, <article>, and standard content container classes, then removes hidden elements, scripts, styles, and other non-content tags before extracting text. The extraction process handles multiple URLs concurrently using thread pools, with configurable timeouts to prevent hanging on slow or unresponsive websites. Failed requests are logged but don't stop processing of other URLs.

The output maintains the same structure as the input search results, with each result item's content field populated with the extracted text. Content is automatically cleaned by collapsing excessive whitespace, normalizing line breaks, and truncating to a configurable maximum length. The node preserves the original title, URL, snippet, and index from search results while adding the full extracted content.

Configuration Parameters

Input Field

Input Field (Text, Required): Workflow variable containing search results in WebSearchResponse format.

The input must be an object containing a query field (string) and a results array where each result has title (string), url (string), snippet (string), and optional index (number) and content (string) fields.

Example input structure:

{
"query": "artificial intelligence",
"results": [
{"title": "AI Overview", "url": "https://example.com/ai", "snippet": "Brief description", "index": 1}
],
"total_results": 1
}

Output Field

Output Field (Text, Required): Workflow variable where enriched search results are stored.

The output is an array of search result objects with the original fields (title, url, snippet, index) plus a content field populated with extracted full text. If extraction fails for a URL, the content field remains empty.

Example output structure:

[
{
"title": "AI Overview",
"url": "https://example.com/ai",
"snippet": "Brief description",
"index": 1,
"content": "Full extracted article text with main content..."
}
]

Common naming patterns: enriched_results, web_content, extracted_pages, full_articles.

Max Content Length

Max Content Length (Number, Default: 10000): Maximum characters to extract per page.

Range is 100 to 100,000 characters. Content exceeding this limit is truncated at the nearest sentence boundary. Lower values (1,000-5,000) suit quick summaries or processing many pages; higher values (20,000-50,000) suit comprehensive analysis of individual articles. Variable interpolation with ${variable_name} syntax supports dynamic limits.

Request Timeout

Request Timeout (Number, Default: 10): HTTP request timeout in seconds.

Range is 1 to 60 seconds. When a request exceeds the timeout, the node logs the failure and continues processing other URLs. Failed requests result in empty content fields. Shorter timeouts (3-5 seconds) provide fast-fail behavior for many URLs; longer timeouts (20-30 seconds) accommodate slow servers.

Common Parameters

This node supports common parameters shared across workflow nodes, including Stream Output Response, Streaming Messages, Logging Mode, and Wait For All Edges. For detailed information, see Common Parameters.

Best Practices

  • Connect directly after Search Web nodes to enrich search snippets with full article content for comprehensive analysis
  • Monitor Max Content Length to balance detail with context window constraints; extracting full content from 10 pages at 50,000 characters each can exceed model limits
  • Request Timeout values should match target websites: news sites typically respond quickly (5-10 seconds), academic or government sites may need longer (15-30 seconds)
  • Descriptive variable names like enriched_search_results improve workflow maintainability
  • Implement conditional logic using IF nodes to check for empty content fields and handle missing data when extraction fails for critical URLs
  • Concurrent processing is fast but may trigger rate limiting on some websites

Limitations

  • No authentication headers: Custom HTTP headers for authentication are not supported. Credentials must be included in URL query parameters, or sites must be publicly accessible.
  • Concurrent processing limit: The node processes URLs concurrently with a maximum of 10 parallel workers (or CPU count × 2, whichever is lower). Large result sets process sequentially in batches.
  • No JavaScript rendering: The node fetches static HTML only and does not execute JavaScript. Dynamically loaded content is not extracted.
  • Content detection heuristics: Main content detection uses common HTML patterns. Websites with non-standard layouts may include navigation or boilerplate in extracted content.
  • SSL certificate validation: The node validates SSL certificates by default. Websites with invalid or self-signed certificates fail to load.
  • No retry logic: Failed HTTP requests are not automatically retried. Network errors, timeouts, or HTTP errors result in empty content for those URLs.